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Abstract 

In this paper, we present a Byzantine fault tolerant dis- 
tributed commit protocol for transactions running over un- 
trusted networks. The traditional two-phase commit proto- 
col is enhanced by replicating the coordinator and by run- 
ning a Byzantine agreement algorithm among the coordi- 
nator replicas. Our protocol can tolerate Byzantine faults 
at the coordinator replicas and a subset of malicious faults 
at the participants. A decision certificate, which includes a 
set of registration records and a set of votes from partici- 
pants, is used to facilitate the coordinator replicas to reach 
a Byzantine agreement on the outcome of each transaction. 
The certificate also limits the ways a faulty replica can use 
towards non-atomic termination of transactions, or seman- 
tically incorrect transaction outcomes. 

Keywords: Distributed Transaction, Two Phase Commit, 
Fault Tolerance, Byzantine Agreement, Web Services 

1. Introduction 

The two-phase commit (2PC) protocol (8) is a standard 
distributed commit protocol lfl2l for distributed transac- 
tions. The 2PC protocol is designed with the assumptions 
that the coordinator and the participants are subject only to 
benign faults, and the coordinator can be recovered quickly 
if it fails. Consequently, the 2PC protocol does not work 
if the coordinator is subject to arbitrary faults (also known 
as Byzantine faults [10]) because a faulty coordinator might 
send conflicting decisions to different participants. Unfor- 
tunately, with more and more distributed transactions run- 
ning over the untrusted Internet, driven by the need for busi- 
ness integration and collaboration, and enabled by the latest 
Web-based technologies such as Web services, it is a realis- 
tic threat that cannot be ignored. 

This problem is first addressed by Mohan et al. in IfTTI by 
integrating Byzantine agreement and the 2PC protocol. The 
basic idea is to replace the second phase of the 2PC pro- 
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tocol with a Byzantine agreement among the coordinator, 
the participants, and some redundant nodes within the root 
cluster (where the root coordinator resides). This prevents 
the coordinator from disseminating conflicting transaction 
outcomes to different participants without being detected. 
However, this approach has a number of deficiencies. First, 
it requires all members of the root cluster, including partici- 
pants, to reach a Byzantine agreement for each transaction. 
This would incur very high overhead if the size of the cluster 
is large. Second, it does not offer Byzantine fault tolerance 
protection for subordinate coordinators or participants out- 
side the root cluster. Third, it requires the participants in the 
root cluster to know all other participants in the same clus- 
ter, which prevents dynamic propagation of transactions. In 
general, only the coordinator should have the knowledge of 
the participants set for each transaction. These problems 
prevent this approach from being used in practical systems. 

Rothermel et al. [ 1 3 1 addressed the challenges of en- 
suring atomic distributed commit in open systems where 
participants (which may also serve as subordinate coor- 
dinators) may be compromised. However, ifPJl assumes 
that the root coordinator is trusted, which limits its useful- 
ness. Garcia-Molina et al. [6| discussed the circumstances 
when Byzantine agreement is needed for distributed trans- 
action processing. Gray [7] compared the problems of dis- 
tributed commit and Byzantine agreement, and provided in- 
sight on the commonality and differences between the two 
paradigms. 

In this paper, we carefully analyze the threats to atomic 
commitment of distributed transactions and evaluate strate- 
gies to mitigate such threats. We choose to use a Byzan- 
tine agreement algorithm only among the coordinator repli- 
cas, which avoids the problems in IfTTI . An obvious candi- 
date for the Byzantine agreement algorithm is the Byzantine 
fault tolerance (BFT) algorithm described in because 
of its efficiency. However, the BFT algorithm is designed 
to ensure totally ordered atomic multicast for requests to a 
replicated stateful server. We made a number of modifica- 
tions to the algorithm so that it fits the problem of atomic 
distributed commit. The most crucial change is made to the 
first phase of the BFT algorithm, where the primary coordi- 



nator replica is required to use a decision certificate, which 
is a collection of the registration records and the votes it 
has collected from the participants, to back its decision on a 
transaction's outcome. The use of such a certificate is essen- 
tial to enable a correct backup coordinator replica to verify 
the primary's proposal. This also limits the methods that a 
faulty replica can use to hinder atomic distributed commit 
of a transaction. 

We integrated our Byzantine fault tolerant distributed 
commit (BFTDC) protocol with Kandula, a well-known 
open source distributed commit framework for Web ser- 
vices [2]. The framework is an implementation of the Web 
Services Atomic Transaction Specification (WS-AT) |4|. 
The measurements show that our protocol incurs only mod- 
erate runtime overhead during normal operations. 

2. Background 

2.1. Distributed Transactions 

A distributed transaction is a transaction that spans 
across multiple sites over a computer network. It should 
maintain the same ACID properties 00 as a local transac- 
tion does. One of the most interesting issues for distributed 
transactions is how to guarantee atomicity, i.e., either all op- 
erations of the transaction succeed in which case the trans- 
action commits, or none of the operations is carried out in 
which case the transaction aborts. 

The middleware supporting distributed transactions is 
often called transaction processing monitors (or TP moni- 
tors in short). One of the main services provided by a TP 
monitor is a distributed commit service, which guarantees 
the atomic termination of distributed transactions. In gen- 
eral, the distributed commit service is implemented by the 
2PC protocol, a standard distributed commit protocol lTT2l . 

According to the 2PC protocol, a distributed transaction 
is modelled to contain one coordinator and a number of par- 
ticipants. A distributed transaction is initiated by one of the 
participants, which is referred to as the initiator. The coor- 
dinator is created when the transaction is activated by the 
initiator. All participants are required to register with the 
coordinator when they get involved with the transaction. As 
the name suggests, the 2PC protocol commits a transaction 
in two phases. During the first phase (also called prepare 
phase), a request is disseminated by the coordinator to all 
participants so that they can prepare to commit the trans- 
action. If a participant is able to commit the transaction, it 
prepares the transaction for commitment and responds with 
a "prepared" vote. Otherwise, it votes "aborted". When 
a participant responded with a "prepared" vote, it enters a 
"ready" state. Such a participant must be prepared to ei- 
ther commit or abort the transaction. A participant that has 
not sent a "prepared" vote can unilaterally abort the trans- 
action. When the coordinator has received votes from every 
participant, or a pre-defined timeout has occurred, it starts 
the second phase by notifying the outcome of the transac- 



tion. The coordinator decides to commit a transaction only 
if it has received the "prepared" vote from every participant 
during the first phase. It aborts the transaction otherwise. 

2.2. Byzantine Fault Tolerance 

Byzantine fault tolerance refers to the capability of a sys- 
tem to tolerate Byzantine faults. It can be achieved by repli- 
cating the server and by ensuring all server replicas receive 
the same input in the same order. The latter means that the 
server replicas must reach an agreement on the input despite 
Byzantine faulty replicas and clients. Such an agreement is 
often referred to as Byzantine agreement 1 10 1. 

Byzantine agreement algorithms had been too expensive 
to be practical until Castro and Liskov invented the BFT 
algorithm mentioned earlier 0. The BFT algorithm is ex- 
ecuted by a set of 3/ + 1 replicas to tolerate / Byzantine 
faulty replicas. One of the replicas is designated as the pri- 
mary while the rest are backups. The normal operation of 
the BFT algorithm involves three phases. During the first 
phase (called pre-prepare phase), the primary multicasts a 
pre-prepare message containing the client's request, the cur- 
rent view and a sequence number assigned to the request to 
all backups. A backup verifies the request message and the 
ordering information. If the backup accepts the message, it 
multicasts to all other replicas a prepare message contain- 
ing the ordering information and the digest of the request 
being ordered. This starts the second phase, i.e., the pre- 
pare phase. A replica waits until it has collected 2/ match- 
ing prepare messages from different replicas before it mul- 
ticasts a commit message to other replicas, which starts the 
third phase (i.e., commit phase). The commit phase ends 
when a replica has received 2/ matching commit messages 
from other replicas. At this point, the request message has 
been totally ordered and it is ready to be delivered to the 
server application. 

If the primary or the client is faulty, a Byzantine agree- 
ment on the order of a request might not be reached, in 
which case, a new view is initiated, triggered by a time- 
out on the current view. A different primary is designated 
in a round-robin fashion for each new view installed. 

3. BFT Distributed Commit 

3.1. System Models 

We consider transactional client/server applications sup- 
ported by an object-based TP monitor such as the WS-AT 
conformant framework [2| used in our implementation. For 
simplicity, we assume a flat distributed transaction model. 
We assume that for each transaction, a distinct coordinator 
is created. The lifespan of the coordinator is the same as the 
transaction it coordinates. 

All transactions are started and terminated by the initia- 
tor. The initiator also propagates the transaction to other 
participants. The distributed commit protocol is started for 
a transaction when a commit/abort request is received from 



the initiator. The initiator is regarded as a special partici- 
pant. In later discussions we do not distinguish the initiator 
and other participants unless it is necessary to do so. 

When considering the safety of our distributed com- 
mit protocol, we use an asynchronous distributed system 
model. However, to ensure liveness, certain synchrony must 
be assumed. Similar to f5), we assume that the message 
transmission and processing delay has an asymptotic upper 
bound. This bound is dynamically explored in the adapted 
Byzantine agreement algorithm in that each time a view 
change occurs, the timeout for the new view is doubled. 

We assume that the transaction coordinator runs sepa- 
rately from the participants, and it is replicated. For sim- 
plicity, we assume that the participants are not replicated. 
We assume that 3/ + 1 coordinator replicas are available, 
among which at most / can be faulty during a transaction. 
There is no limit on the number of faulty participants. Simi- 
lar to [5 1, each coordinator replica is assigned a unique id i, 
where i varies from to 3/. For view v, the replica whose id 
i satisfies i = v mod (3/ + 1) would serve as the primary. 
The view starts from 0. For each view change, the view 
number is increased by one and a new primary is selected. 

In this paper, we call a coordinator replica correct if it 
does not fail during the distributed commit for the trans- 
action under consideration, i.e., it faithfully executes ac- 
cording to the protocol prescribed from the start to the end. 
However, we call a participant correct if it is not Byzantine 
faulty, i.e., it may be subject to typical non-malicious faults 
such as crash faults or performance faults. 

The coordinator replicas are subject to Byzantine faults, 
i.e., a Byzantine faulty replica can fail arbitrarily. For par- 
ticipants, however, only a subset of faulty behaviors are tol- 
erated, such as a faulty participant sending conflicting votes 
to different coordinator replicas. Some forms of participant 
Byzantine behaviors cannot be addressed by the distributed 
commit protocol Q 

For the initiator, we further limits its Byzantine faulty 
behaviors. In particular, it does not exclude any correct par- 
ticipant from the scope of the transaction, or include any 
participant that has not registered properly with the coordi- 
nator replicas, as discussed below. 

To ensure atomic termination of a distributed transaction, 
it is essential that all correct coordinator replicas agree on 
the set of participants involved in the transaction. In this 
work, we defer the Byzantine agreement on the participants 
set until the distributed commit stage and combine it with 
that for the transaction outcome. To facilitate this optimiza- 
tion, we need to make the following additional assumptions. 

We assume that there is proper authentication mecha- 
nism in place to prevent a Byzantine faulty process from 
illegally registering itself as a participant at correct coor- 
dinator replicas. Furthermore, we assume that a correct 



1 For example, a Byzantine faulty participant can vote to commit a 
transaction while actually aborting it, and vice versa. 



participant registers with / + 1 or more correct coordina- 
tor replicas before it sends a reply to the initiator when the 
transaction is propagated to this participant with a request 
coming from the initiator. If a correct participant crashes 
before the transaction is propagated to itself, or before it 
finishes registering with the coordinator replicas, either no 
reply is sent back to the initiator, or an exception is thrown 
back to the initiator. As a result, the initiator should decide 
to abort the transaction. The interaction pattern among the 
initiator, participants and the coordinator is identical to that 
described in the WS-AT specification [4|, except that the 
coordinator is replicated in this work. 

All messages between the coordinator and the partici- 
pants are digitally signed. We assume that the coordinator 
replicas and the participants each has a public/secret key 
pair. The public keys of the participants are known to all 
coordinator replicas, and vice versa, while the private key is 
kept secret to its owner. We assume that the adversaries 
have limited computing power so that they cannot break 
the encryption and digital signatures of correct coordinator 
replicas. 

3.2. BFTDC Protocol 

Figure Q] shows the pseudo-code of the our Byzantine 
fault tolerant distributed commit protocol. Comparing with 
the 2PC protocol, there are two main differences: 

- At the coordinator side, an additional phase of Byzan- 
tine agreement is needed for the coordinator replicas 
to reach a consensus on the outcome of the transaction, 
before they notify the participants. 

- At the participant side, a decision (commit or abort 
request) from a coordinator replica is queued until at 
least f+1 identical decision messages have been re- 
ceived, unless the participant unilaterally aborts the 
transaction. This is to make sure that at least one of 
the decision messages come from a correct coordina- 
tor replica. 

The distributed commit for a transaction starts when a 
coordinator replica receives a commit request from the ini- 
tiator. If the coordinator replica receives an abort request 
from the initiator, it skips the first phase of the distributed 
commit. In any case, a Byzantine agreement is conducted 
on the decision regarding the transaction's outcome. 

The operations of each coordinator replica is defined in 
the BFTDistributedCommit() method in Fig. Q] During the 
prepare phase, a coordinator replica sends a prepare request 
to every participant in the transaction. The prepare request 
is piggybacked with a prepare certificate, which contains 
the commit request sent (and signed) by the initiator. 

When a participant receives a prepare request from a co- 
ordinator replica, it verifies the correctness of the signature 
of the message and the prepare certificate (if the partici- 
pant does not know the initiator's public key, this step is 



Method: BFTDistributedCommit(CommitRequest) 
begin 

PrepareCert := CommitRequest; 

Append PrepareCert to PrepareRequest ; 

Multicast PrepareRequest; 

VoteLog := CollectVotes { ) ; 

Add VoteLog to DecisionCert; 

decision := Byzant ineAgreement (DecisionCert) ; 
if decision = Commit then Multicast Commit Request ; 
else Multicast AbortRequest; 
Return decision; 
end 

Method: PrepareTransaction(PrepareRequest) 
begin 

if VerifySignature (PrepareRequest ) = false then 
|_ Discard PrepareRequest and return; 

if HasPrepareCert (PrepareRequest) = false then 
|_ Discard PrepareRequest and return; 

if P is willing to commit T then 
I Log ( <Prepared T>) to stable storage; 
Send ' 'prepared' ' to coordinator; 
else 

|_ Log(<Abort T>) ; Send ''aborted'' to coordinator; 

end 

Method: Commit Transaction! Commit Request) 
begin 

if VerifySignature(CommitRequest) = false then 
|_ Discard CoramitRequest and return; 

Append CoramitRequest to DecisionLog; 

if CanMakeDecis ion (commit , DecisionLog) then 

I Log (<Commit T>) to stable storage; 

|_ Send ''committed'' to coordinator; 

end 

Method : Abort Transaction(AbortRequest) 
begin 

if VerifySignature (AbortRequest) = false then 
|_ Discard AbortRequest and return; 

Append AbortRequest to DecisionLog; 

if CanMakeDec is ion (abort , DecisionLog) then 

I Log(<Abort T>) ; Abort T locally; 

[_ Send ' 'aborted' ' to coordinator; 

end 

Method: CanMakeDeeision( decision, DecisionLog) 
begin 

NumOf Decisions := 0; 
foreach Message in DecisionLog do 
I if GetDecision (Message) = decision then 
|_ NumOfDecisions++; 

if NumOf Decisions >= f+1 then Return true; 
else Return false; 
end 

Figure 1 . Pseudo-code for our Byzantine fault 
tolerant distributed commit protocol. 

skipped). The prepare request is discarded if any of the veri- 
fication steps fails. Even though the check for a prepare cer- 
tificate is not essential to the correctness of our distributed 
commit protocol, it nevertheless can prevent a faulty coordi- 
nator replica from instructing some participants to prepare 
a transaction, even after the initiator has requested to abort 
the transaction. 

At the end of the prepare phase, all correct coordinator 
replicas engage in an additional round for them to reach 
a Byzantine agreement on the outcome of the transaction. 
The Byzantine agreement algorithm used in this phase is 
elaborated in Section [331 

When a participant receives a commit request from a co- 
ordinator replica, it commits the transaction only if it has 
received the same decision from / other replicas so that at 



least one of them comes from a correct replica. The han- 
dling of an abort request is similar. 

3.3. Byzantine Agreement Algorithm 

The Byzantine agreement algorithm used in the BFTDC 
protocol is adapted from the BFT algorithm by Castro and 
Liskov @. To avoid possible confusion with the terms used 
to refer to the distributed commit protocol, the three phases 
during normal operations are referred to as ba-pre-prepare, 
ba-prepare, and ba-commit. Our algorithm differs from the 
BFT algorithm in a number of places due to different objec- 
tives. The BFT algorithm is used for server replicas to agree 
on the total ordering of the requests received, while our al- 
gorithm is used for the coordinator replicas to agree on the 
outcome (and participants set) of each transaction. In our al- 
gorithm, the ba-pre-prepare message is used to bind a deci- 
sion (to commit or abort) with the transaction under concern 
(represented by a unique transaction id). In 0, the ba-pre- 
prepare message is used to bind a request with an execution 
order (represented by a unique sequence number). Further- 
more, for distributed commit, an instance of our algorithm 
is created and executed for each transaction. When there are 
multiple concurrent transactions, multiple instances of our 
algorithm are running concurrently and independently from 
each other (the relative ordering of the distributed commit 
for different transactions is not important). In [5|, however, 
a single instance of the BFT algorithm is used for all re- 
quests to be ordered. 

When a replica completes the prepare phase of the dis- 
tributed commit for a transaction, an instance of our Byzan- 
tine agreement algorithm is created. The algorithm starts 
with the ba-pre-prepare phase. During this phase, the pri- 
mary p sends a ba-pre-prepare message including its de- 
cision certificate to all other replicas. The ba-pre-prepare 
message has the form <BA-PRE-PREPARE, v, t, o, C> CTp , 
where v is the current view number, t is the transaction 
id, o is the proposed transaction outcome (i.e., commit or 
abort), C is the decision certificate, and a p is the signature 
of the message signed by the primary. The decision certifi- 
cate contains a collection of records, one for each partici- 
pant. The record for a participant j contains a signed reg- 
istration Rj — (t,j) aj and a signed vote Vj = (t,vote) aj 
for the transaction t, if a vote from j has been received by 
the primary. The transaction id is included in each registra- 
tion and vote record so that a faulty primary cannot reuse 
an obsolete registration or vote record to force a transac- 
tion outcome against the will of some correct participants 
(which may lead to non-atomic transaction commit). 

A backup accepts a ba-pre-prepare message provided: 

- The message is signed properly by the primary. The 
replica is in view v, and it is handling transaction t. 

- It has not accepted a ba-pre-prepared message for 
transaction t in view v. 



- The registration records in C are identical to, or form 
a superset of, the local registration records. 

- Every vote record in C is properly signed by its send- 
ing participant and the transaction identifier in the 
record matches that of the current transaction, and the 
proposed decision o is consistent with the registration 
and vote records. 

Note that a backup does not insist on receiving a decision 
certificate identical to its local copy. This is because a cor- 
rect primary might have received a registration from a par- 
ticipant which the backup has not, or the primary and back- 
ups might have received different votes from some Byzan- 
tine faulty participants, or the primary might have received 
a vote that a backup has not received if the sending partici- 
pant crashed right after it has sent its vote to the primary. 

If the registration records in C form a superset of the lo- 
cal registration records, the backup updates its registration 
records and asks the primary replica for the endpoint ref- 
erence of each missing participant (so that it can send its 
notification to the participant). 

A backup suspects the primary and initiates a view 
change immediately if the ba-pre-prepare message fails the 
verification. Otherwise, the backup accepts the ba-pre- 
prepare message. At this point, we say the replica has ba- 
pre-prepared for transaction t. It then logs the accepted ba- 
pre-prepare message and multicasts a ba-prepare message 
with the same decision o as that in the ba-pre-prepare mes- 
sage (this starts the ba-prepare phase). The ba-prepare mes- 
sage takes the form < BA-PREPARE, v, t, d, o, i> ai , where d 
is the digest of the decision certificate C. 

A coordinator replica j accepts a ba-prepare message 
provided: 

- The message is correctly signed by replica i, and 
replica j is in view v and the current transaction is t; 

- The decision o matches that in the ba-pre-prepare mes- 
sage; 

- The digest d matches the digest of the decision certifi- 
cate in the accepted ba-pre-prepare message. 

If a replica has collected 2/ matching ba-prepare mes- 
sages from different replicas (including the replica's own 
ba-prepare message if it is a backup), the replica is said to 
have ba-prepared to make a decision on transaction t. This 
is the end of the ba-prepare phase. 

A ba-prepared replica enters the ba-commit phase by 
multicasting a ba-commit message to all other repli- 
cas. The ba-commit message has the form <BA- 
COMMIT, v, t, d, o, i>oi- The replica i is said to have ba- 
committed, if it has obtained 2/ + 1 matching ba-commit 

2 The term endpoint reference refers to the physical contact information 
such as host and port of a process. In Web services, an endpoint reference 
typically contains a URL to a service and an identifier used by the service 
to locate the specific handler object |9|. 



messages from different replicas (including the message it 
has sent). When a replica is ba-committed for transaction t, 
it sends the decision o to all participants of transaction t. 

If a replica i could not advance to the ba-committed state 
until a timeout, it initiates a view change by sending a view 
change message to all other replicas. The view change mes- 
sage has the form <VIEW-CHANGE, v+1, t, P, i> <Ti , where 
P contains information regarding its current state. If the 
replica has ba-pre-prepared t in view v, it includes a tuple 
<v, t, o, C>. If it has ba-prepared t in view v, it includes 
both the tuple <v,t,o,C> and 2/ matching ba-prepared 
messages from different replicas for t obtained in view v. 
If the replica has not ba-pre-prepared t, it includes its own 
decision certificate C. 

A correct replica that has not timed out the current view 
multicasts a view change message only if it is in view v and 
it has received valid view change messages for view u + 1 
from / + 1 different replicas. This is to prevent a faulty 
replica from inducing unnecessary view changes. A view 
change message is regarded as valid if it is for view v + 1 
and the ba-pre-prepare and ba-prepare information included 
in P, if any, is for transaction t in a view up to v. 

When the primary for view u + 1 receives 2/ + 1 
valid view change messages for u + 1 (including the one 
it has sent or would have sent), it installs the new view, 
and multicasts a new view message, in the form < NEW- 
VIEW, v + 1, V, t, o, C> for view v + 1, where V contains 
2/ + 1 tuples for the view change messages received for 
view v + 1. Each tuple has the form <i,d>, where i is 
the sending replica, and d is the digest of the view change 
message. The proposed decision o for t and the decision 
certification C are determined according to the following 
rules: 

1 . If the new primary has received a view change message 
containing a valid ba-prepare record for t, and there is 
no conflicting ba-prepare record, it uses that decision. 

2. Else, the new primary rebuilds a set of registration 
records from the received view change messages. This 
new set may be identical to, or a superset of, the regis- 
tration set known to the new primary prior to the view 
change. The new primary then rebuilds a set of vote 
records in a similar manner. It is possible that conflict- 
ing vote records are found from the same participant 
(i.e., , a participant sent a "prepared" vote to one co- 
ordinator replica, while sending an "aborted" vote to 
some other replicas), in which case, a decision has to 
be made on the direction of the transaction t. In this 
work, we choose to take the "prepared" vote to maxi- 
mize the commit rate. A new decision certificate will 
be constructed and a decision for t's outcome is pro- 
posed accordingly. They will be included in the new 
view message for view v + 1. 

When a backup receives the new view message, it veri- 
fies the message basically by following the same steps used 



by the primary. If the replica accepts the new view message, 
it may need to retrieve the endpoint references for some par- 
ticipants that it did not receive from other correct replicas. 
When a backup replica has accepted the new view message 
and obtained all missing information, it sends a ba-prepare 
message to all other replicas. The algorithm then proceeds 
according to its normal operations. 

3.4. Informal Proof of Correctness 

We now provide an informal proof of the safety of our 
Byzantine agreement algorithm and the distributed commit 
protocol. Due to space limitation, the proof for liveness is 
omitted. 

Claim 1: If a correct coordinator replica ba-commits 
a transaction t with a commit decision, the registration 
records of all correct participants must have been included 
in the decision certificate, and all such participants must 
have voted to commit the transaction. 

We prove by contradiction. Assume that there exists a 
correct participant p whose registration is left out of the de- 
cision certificate. Since a correct coordinator replica has ba- 
committed t with a commit decision, it must have accepted 
a ba-pre-prepare message and 2/ matching ba-prepare mes- 
sage from different replicas. This means that a set Ri of 
2/ + 1 replicas have all accepted the same decision cer- 
tificate without the participant p, the initiator has requested 
the coordinator replicas to commit t, and every participant 
in the registration set has voted to commit the transaction. 
This further implies that the initiator has received normal 
replies from all participants, including p, to which it has 
propagated the current transaction. Because the participant 
p is correct and responded to the initiator's request prop- 
erly, it must have registered with at least 2/ + 1 coordinator 
replicas prior to sending its reply to the initiator. Among the 
2/+1 coordinator replicas, at least a set R 2 of f+1 replicas 
are correct, i.e., all replicas in R 2 are correct and have the 
registration record for p prior to the start of the distributed 
commit for t. Because the total number of replicas is 3/+1, 
the two sets Ri and R 2 must intersect in at least one correct 
replica. The correct replica in the intersection either did not 
receive the registration from p, or it has accepted a decision 
certificate without the registration record for p despite the 
fact that it has received the registration from p, which is im- 
possible. Therefore, all correct participants must have been 
included in the decision certificate if any correct replica ba- 
committed a transaction with a commit decision. 

We next prove that if any correct replica ba-committed 
a transaction with a commit decision, all correct partici- 
pants must have voted to commit the transaction. Again, 
we prove by contradiction. Assume that the above state- 
ment is not true, and a correct participant q has voted to 
abort the transaction t. Since we have proved above that 
q's registration record must have been included in the de- 
cision certificate, its vote cannot be ignored. Furthermore, 
since a correct replica ba-committed t with a commit de- 



cision, the set i?i of 2/ + 1 replicas have all accepted the 
commit decision. Again, since i?i and i? 2 must intersect 
by at least one correct replica, that replica both accepted the 
commit decision and has received the "aborted" vote from 
q. This is possible only if the ba-pre-prepare message that 
the replica has accepted contains a "prepared" vote from q. 
This contradict to the fact that q is a correct participant. A 
correct participant never sends conflicting votes to different 
coordinator replicas. This concludes our proof for claim 1 . 

Claim 2: Our Byzantine agreement algorithm ensures 
that all correct coordinator replicas agree on the same de- 
cision regarding the outcome of a transaction. 

We prove by contradiction. Assume that two correct 
replicas i and j reach different decisions for t, without loss 
of generality, assume i decides to abort t in a view v and j 
decides to commit t in a view u. 

First, we consider the case when v = u. According to 
our algorithm, i must have accepted a ba-pre-prepare mes- 
sage with an abort decision supported by a decision certifi- 
cate, and 2/ matching ba-prepare messages from different 
replicas, all in view v, this means a set R3 of at least 2/ + 1 
replicas have ba-prepared t with an abort decision in view 
v. Similarly, replica j must have accepted a ba-pre-prepare 
message with a commit decision supported by a decision 
certificate, and 2/ matching ba-prepare messages from dif- 
ferent replicas for transaction t in the same view v, which 
means a set R4 of at least 2/ + 1 replicas have ba-prepared 
t with a commit decision in view v. Since there are only 
3/ + 1 replicas, the two sets R 3 and R4 must intersect in at 
least f + 1 replicas, among which, at least one is a correct 
replica. It means that this replica must have accepted two 
conflicting ba-pre-prepare messages (one to commit and the 
other to abort) in the same view. This contradicts the fact 
that it is a correct replica. 

Next, we consider the case when view u > v. Since 
replica i ba-committed with an abort decision for t in view 
v, it must have received 2/ + 1 matching ba-commit mes- 
sages from different replicas (including the one sent by it- 
self). This means that a set R 5 of 2/ + 1 replicas have 
ba-prepared t in view v, all with the same decision to abort 
t. To install a new view, the primary of the new view must 
have received view change messages (including the one it 
has sent or would have sent) from a set R 6 of If + 1 repli- 
cas. Similar to the previous argument, the two sets R$ and 
Re intersect in at least f + 1 replicas, among which, at least 
one must be a correct replica. This replica would have in- 
cluded the decision and the decision certificate backed by 
the ba-pre-prepare message and the 2/ matching ba-prepare 
messages it has received from other replicas, in its view 
change message. The primary in the new view, if it is cor- 
rect, must have used the decision and decision certificate 
from this replica. This should have led all correct replicas to 
ba-commit transaction t with an abort decision, which con- 
tradicts to the assumption that a correct replica committed 
t. If the primary is faulty and did not obey the new view 



construction rule, we argue that no correct replica could 
have accepted the new view message, let alone to have ba- 
committed t with a commit decision. Recall that a correct 
replica should verify the new view message by following 
the new view construction rules, just as a correct primary 
would do. We have proved above that the 2/ + 1 view 
change messages must contain one sent by a correct replica 
with ba-prepare information for an abort decision. A correct 
replica cannot possibly have accepted the new view mes- 
sage sent by the faulty primary, which contains a conflict- 
ing decision. This contradicts to the initial assumption that 
a correct replica j committed transaction t in view u. The 
proof for the case when v > u is similar. Therefore, claim 
2 is correct. 

Claim 3: The BFTDC protocol guarantees atomic termi- 
nation of transactions at all correct participants. 

We prove by contradiction. Assume that a transaction t 
commits at a participant p but aborts at another participant 
q. According to the criteria indicated in the CommitTrans- 
actionQ method shown in Fig.Q] p commits the transaction 
t only if it has received the commit request from at least 
f + 1 different coordinator replicas. Since at most / replicas 
are faulty, at least one request comes from a correct replica. 
Due to claim 1, if any correct replica ba-committed a trans- 
action with a commit decision, then the registration records 
of all correct participants must have been included in the 
decision certificate, and all correct participants must have 
voted to commit the transaction. 

On the other hand, since q aborted t, one of the following 
two scenarios must be true: (1) q unilaterally aborted t, in 
which case, it must not have sent a "prepared" vote to any 
coordinator replica. (2) q received a prepare request, pre- 
pared t, sent a "prepared" vote to one or more coordinator 
replicas. But it received an abort request from at least f + 1 
different coordinator replicas. 

If the first scenario is true, q might or might not have 
finished its registration process. If it did not, the initiator 
would have been notified by an exception, or would have 
timed out q. In any case, the initiator should have decided 
to abort t. This conflicts with the fact that p has committed 
t because it implies that the initiator has asked the coordi- 
nator replicas to commit t. If q completed the registration 
process, its registration record should have been aware by a 
set i?7 of at least f+1 correct replicas. Since p has commit- 
ted t, at least one correct replica has ba-committed t with a 
commit decision, which in turn implies that a set i?g of at 
least 2f + 1 coordinator replicas have accepted a ba-pre- 
prepare message with a decision certificate either has no q 
in its registration records, or without q's "prepared" vote. 
Since there are 3/ + 1 replicas, and i?g must intersect 
in at least one replica. This correct replica could not possi- 
bly have accepted a ba-pre-prepare message with a decision 
certificate described above. 

For the second scenario, at least one correct replica has 
decided to abort t. Since another participant p committed 



t, at least one correct replica has decided to commit t. This 
contradicts to claim 2, which we have proved to be true. 
Therefore, claim 3 is correct. 

4. Implementation and Performance 

We have implemented the BFTDC protocol (with the ex- 
ception of the view change mechanisms) and integrated it 
into a distributed commit framework for Web services in 
Java programming language. The extended framework is 
based on a number of Apache Web services projects, includ- 
ing Kandula (an implementation of WS-AT) 0, WSS4J 
(an implementation of the Web Services Security Specifi- 
cation) [3], and Apache Axis (SOAP Engine) Q. Most of 
the mechanisms are implemented in terms of Axis handlers 
that can be plugged into the framework without affecting 
other components. Some of the Kandula code is modified 
to enable the control of its internal state, to enable a Byzan- 
tine agreement on the transaction outcome, and to enable 
voting. Due to space constraint, the implementation details 
are omitted. 

For performance evaluation, we focus on assessing the 
runtime overhead of our BFTDC protocol during normal 
operations. Our experiment is carried out on a testbed con- 
sisting of 20 Dell SC1420 servers connected by a 100Mbps 
Ethernet. Each server is equipped with two Intel Xeon 
2.8GHz processors and 1GB memory running SuSE 10.2 
Linux. 

The test application is a simple banking Web services 
application where a bank manager (i.e., initiator) transfers 
funds among the participants within the scope of a dis- 
tributed transaction for each request received from a client. 
The coordinator-side services are replicated on 4 nodes to 
tolerate a single Byzantine faulty replica. The initiator and 
other participants are not replicated, and run on distinct 
nodes. The clients are distributed evenly (whenever pos- 
sible) among the remaining nodes. Each client invokes a 
fund transfer operation on the banking Web service within 
a loop without any "think" time between two consecutive 
calls. In each run, 1000 samples are obtained. The end- 
to-end latency for the fund transfer operation is measured 
at the client. The latency for the distributed commit and 
the Byzantine agreement is measured at the coordinator 
replicas. Finally, the throughput of the distributed commit 
framework is measured at the initiator for various number 
of participants and concurrent clients. 

To evaluate the runtime overhead of our protocol, we 
compare the performance of our BFTDC protocol with the 
2PC protocol as it is implemented in the WS-AT framework 
with the exception that all messages exchanged over the net- 
work are digitally signed. 

In Figure [2ja), we included the distributed commit la- 
tency and the end-to-end latency for both our protocol (in- 
dicated by "with bft") and the original 2PC protocol (indi- 
cated by "no bft"). The Byzantine agreement latency is also 
shown. Figure |2|b) shows the throughput measurement re- 



2500 



2000 



1500 



1000 



SOU 



Byzantine Agreement Latency 
Distributed Commit Latency (no bft) 
Distributed Commit Latency (with bft) 
End-to-End Latency (no bft) 
End-to-End Latency (with bft) 




3 4 5 6 7 8 9 
Number of Participants in Each Transaction 

(a) 



10 



g 



5.0 - 



4.0 



3.0 



2.0 



1.0 



0.0 



2 Participants (no bft) 

2 Participants 
4 Participants 
6 Participants 

3 Participants 
10 Participants 




3 4 5 6 7 
Number of Concurrent Clients 



10 



(b) 



Figure 2. (a) Various latency measurements for transactions with different number of participants 
under normal operations (with a single client), (b) Throughput of the distributed commit service 
in terms of transactions per second for transactions with different number of participants under 
different load. 



suits for transactions using our protocol with up to 10 con- 
currently running clients and 2-10 participants in each trans- 
action. For comparison, the throughput for transactions us- 
ing the 2PC protocol for 2 participants is also included. 

As can be seen in Figure |2|a), the latency for the dis- 
tributed commit and the end-to-end latency both are in- 
creased by about 200-400 ms when the number of partici- 
pants varies from 2 to 10. This increase is mostly attributed 
to the introduction of the Byzantine agreement phase in our 
protocol. Percentage-wise, the end-to-end latency, as per- 
ceived by an end user, is increased by only 20% to 30%, 
which is quite moderate. We observe a similar range of 
throughput reductions for transactions using our protocol, 
as shown in Figure|2ib). 

5. Conclusions 

In this paper, we presented a Byzantine fault tolerant dis- 
tributed commit protocol. We carefully studied the types 
of Byzantine faults that might occur to a distributed trans- 
actional systems and identified the subset of faults that a 
distributed commit protocol can handle. We adapted Cas- 
tro and Liskov's BFT algorithm to ensure Byzantine agree- 
ment on the outcome of transactions. We also proved infor- 
mally the correctness of our BFTDC protocol. A working 
prototype of the protocol is built on top of an open source 
distributed commit framework for Web services. The mea- 
surement results of our protocol show only moderate run- 
time overhead. We are currently working on the implemen- 
tation of the view change mechanisms and exploring addi- 
tional mechanisms to protect a TP monitor against Byzan- 
tine faults, not only for distributed commit, but for activa- 
tion, registration, and transaction propagation as well. 
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